Steps Towards More Natural Human-Machine Interaction via Audio-Visual Word Prominence Detection
نویسنده
چکیده
We investigate how word prominence can be detected from the acoustic signal and movements of the speaker’s head and mouth. Our research is based on a corpus with 12 English speakers which contains in addition to the speech signal also videos of the talker’s head. To extract the word prominence information we use on one hand functionals calculated on the features and on the other hand Functional PCA (FPCA) to extract information from the contours. Combining the functionals and the contour information we obtain a discrimination accuracy between prominent and non-prominent words of 81%. We show in particular that the visual channel is very informative for some speakers. Furthermore, we also introduce a system which extracts the prominence information online while a user is interacting with the system. The online system only uses acoustic information.
منابع مشابه
Integrating sequence information in the audio-visual detection of word prominence in a human-machine interaction scenario
Modifying the articulatory parameters to raise the prominence of a segment of an utterance (hyperarticulating) is usually accompanied by a reduction of these parameters (hypoarticulation) for the neighboring segments. In this paper we investigate different approaches for the automatic labeling of the prominence of words. In particular, we investigate how the information in the sequence can be u...
متن کاملAudio-visual Evaluation and Detection of Word Prominence in a Human-Machine Interaction Scenario
This paper investigates the audio-visual correlates and the detection of word prominence. Subjects were interacting with a computer in a small game which created a broad and a narrow focus condition. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features duration, intensity, fundamental frequency and spectral emphasis were calculated. From t...
متن کاملFeature-Level Decision Fusion for Audio-Visual Word Prominence Detection
Common fusion techniques in audio-visual speech processing operate on the modality level. I.e. they either combine the features extracted from the two modalities directly or derive a decision for each modality separately and then combine the modalities on the decision level. We investigate the audio-visual processing of linguistic prosody, more precisely the extraction of word prominence. In th...
متن کاملLearning words from natural audio-visual input
We present a model of early word learning which learns from natural audio and visual input. The model has been successfully implemented to learn words and their audio-visual grounding from camera and microphone input. Although simple in its current form, this model is a rst step towards a more complete, fully-grounded model of language acquisition. Practical applications include adaptive human-...
متن کاملDifferences between Speakers in Audio-visual Classification of Word Prominence
We show how the audio-visual discrimination performance of prominent from non-prominent words based on an SVM classifier varies from speaker to speaker. We collected data in an experiment where users were interacting via speech in a small game, designed as a Wizard-of-Oz experiment, with a computer. Following misunderstandings of one single word of the system, users were instructed to correct t...
متن کامل